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Abstract 

Nonconvex optimization problems such as the ones in training deep neural net¬ 
works suffer from a phenomenon called saddle point proliferation. This means 
that there are a vast number of high error saddle points present in the loss function. 
Second order methods have been tremendously successful and widely adopted in 
the convex optimization community, while their usefulness in deep learning re¬ 
mains limited. This is due to two problems; computational complexity and the 
methods being driven towards the high error saddle points. We introduce a novel 
algorithm specially designed to solve these two issues, providing a crucial first 
step to take the widely known advantages of Newton’s method to the nonconvex 
optimization community, especially in high dimensional settings. 


1 Introduction 

The loss functions arizing from learning deep neural networks are incredibly nonconvex, so the fact 
that they can be successfully optimized in a lot of problems remains a partial mistery. However, 
some recent work has started to shed light on this issue naa , leading to three likely conclusions: 

• There appears to be an exponential number of local minima. 

• However, all local minima lie within a small range of error with overwhelming proability. 
Almost all local minima will therefore have similar error to the global minimum. 

• There are exponentially more saddle points than minima, a phenomenon called saddle point 
proliferation. 

These consequences point to the fact that the low dimensional picture of getting "stuck" in a high 
error local minima is mistaken, and that finding a local minimum is actually a good thing. However, 
Newton’s method (the core component of all second order methods), is biased towards finding a 
critical point, any critical point. In the presence of an overwhelming number of saddle points, it is 
likely that it will get stuck in one of them instead of going to a minimum. 

Let / be our loss function, V/ and H be it’s gradient and Hessian respectively, and a our learning 
rate. The step taken by an algorithm at iteration k is denoted by A0fc. The property of Newton being 
driven towards a close critical point can easily be seen by noting that its update equation 

A0k = -aU{0k)-^Vf{0k) (1) 

comes from taking a second order approximation of our loss function, and solving for the closest 
critical point of this approximation (i.e. setting its gradient to 0). 

To overcome this problem of Newton’s method, a proposes a different algorithm, called saddle-free 
Newton, or SFN. The update equation for SFN is defined as 

A0k = -a\U{0k)\-^Vf{0k) (2) 
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The absolute value notation in equation|2]means that | A| is obtained by replacing the eigenvalues of 
A with their absolute values. In the convex case, this changes nothing from Newton. However, in 
the nonconvex case, this allows one to keep the very smart rescaling of Newton’s method, but still 
going on a descent direction when the eigenvalues are negative. 

While saddle-free Newton showed great promise, its main problem is the computational complexity 
it carries. Let m be the number of parameters, or more generally, the dimension of /’s domain. 
The cost of calculating the update in equation |2 is the cost of diagonalizing (and then inverting) 
the matrix |H|, namely 0{m^). Furthermore, this has a memory cost of 0{m?), because it needs 
to store the full Hessian. Since in neural network problems m is typically bigger than 10®, both 
costs are prohibitive, which is the reason |01 employs a low-rank approximation. Using a rank k 
approximation, the algorithm has 0{k‘^m) time cost and 0{km) memory cost. While this is clearly 
cheaper than the full method, it’s still intractable for current problems, since in order to get a useful 
approximation the k required becomes prohibitively large, especially for the memory cost. 

Another line of work is the one followed by Hessian-free optimization 101, popularly known as HE 
This method centers in three core ideas; 

• The Gauss-Newton method, which consists in replacing the use of the Hessian in Newton’s 
algorithm ([T]i for the Gauss-Newton matrix G. This matrix is a positive definite approxima¬ 
tion of the Hessian, and it has achieved a good level of applicability in convex problems. 
However, the behaviour when the loss is nonconvex is not well understood. Furthermore, 
@] argues against using the Gauss-Newton matrix on neural networks, showing it suffers 
from poor conditioning and drops the negative curvature information, which is argued to 
be crucial. Note that this is a major difference with SFN, which leverages the negative 
curvature information, keeping the scaling in these directions. 

• Using conjugate gradients (CG) to solve the system G{0k)~^Vf{6k)- One key advantage 
of CG is that it doesn’t require to store G{6k), only to calculate matrix-vector products of 
the form G{dk)v for any vector v. The other advantage of this method is that it’s iterative, 
allowing for early stopping when the solution to the system is good enough. 

• When using neural networks, the 7^-operator MM is an algorithm to calculate matrix- 
vector products of the form Hu and Gv in Oim) time without storing any matrix. This is 
obviously very efficient, since normally multiplying an m-by-m matrix with a vector has 
0{m'^) time and memory cost. 

While Hessian-free optimization is computationally efficient, the use of the Gauss-Newton matrix 
in nonconvex objectives is thought to be inneffective. The update equation of saddle-free Newton 
is specially designed for this kind of problems, but current implementations lack computational 
efficiency. 

In the following section, we propose a new algorithm that takes the advantages of both approaches. 
This renders a novel second order method that’s computationally efficient, and specially designed 
for nonconvex optimization problems. 

2 Saddle-free Hessian-free Optimization 

Something that comes to mind is the possibility of using conjugate gradients to solve the system 
|H|~^ V/ appearing in equation (|2]i. This would allow us to have an iterative method, and possibly 
do early stopping when the solution to the system is good enough. However, in order to do that 
we would need to calculate |H|u for any vector v. While this was easy with Hu and Gu via the 
7?^-operator, it doesn’t extend to calculating |H|u, so we arrive at an impass. 

The first step towards our new method comes from the following simple but important observation, 
that we state as a Lemma. 

Lemma 1. Let H be a real symmetric m-by-m matrix. Then, |H|^ = H^. 

Proof. First we prove this for a real diagonal matrix D. We denote ^ = A^. By definition, we 
have that |D|j ^ = |Ai| and it vanishes on the off-diagonal entries. Therefore, it is trivially verified 

that (|D|^)i i = |Ai|^ = A| = (D^)^ ^ and both matrices are diagonal, which makes them coincide. 
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Let H be a real symmetric m-by-m matrix. By the spectral theorem, there is a real diagonal matrix 
D and an orthogonal matrix U such that H = UDU“^. Therefore, 

|H|^ = (U|D|U-i)^ = U|D|U-^U|D|U-i 
= U|D|^U-i = UD^U-i 

= UDU"^UDU"^ = (UDU"^)^ 

= 


□ 

Let A be a (semi-)positive definite matrix. Recalling that the square root of A (noted as A^) is 
defined as the only (semi-)positive definite matrix B such that = A, we have the following 
corollary. 

Corollary 2.1. Let Ube a real square matrix. Then, |H| is the square root o/H^. Namely, |H| = 

Note that our main impass is not knowing how to calculate |H|u for any vector v. However, we 
know how to calculate = H(Hu) by applying the 72.-operator twice. Therefore, the problem 
can be reformulated as; given a positive definite matrix A, of which we know how to calculate Au 
for any vector u, can we calculate A^v for a given vector vl 

The answer to this question is yes. As illustrated by lH, we can define the following initial value 
problem: 

r x'{t) = -i {tA + (1 - (I - A) x{t) 

\x(0)=u 

When the norm of A is small enough (which can be trivially rescaled), one can show that the ordinary 
differential equation (O has the unique solution 

x{t) = {tA + (1 — f)I)^ V 

This solution has the crucial property that a:(l) = A^v. Therefore, to calculate A^ u we can initial¬ 
ize a;(0) = V, and plug in equation (O to an ODE solver such as the different Runge Kutta methods. 
The second core property of this formulation is that in order to do the derivative evaluations required 
to solve the ODE, we only need to multiply by (I — A) and solve systems by (tA + (1 — t)I), both 
of which can be done only with products of the form Au without storing any matrix, using conjugate 
gradients for the linear systems. 

In order to solve our problem of approximating SEN in a Hessian-free way, we could calculate 
|H|u using Au := tPu in the previous method and do conjugate gradients to solve the system in 
(|2]). However, this would require solving an ODE for every iteration of conjugate gradients, which 
would be quite expensive. Therefore, we propose to calculate update (|2|i in a two-step manner. Eirst, 
we multiply by |H| and then we divide by = |H|^: 

y ^ |H(0fe)|V/(0fc) 

^ -a(H(0fc)2)-iy 

Combining this approach with our approximation schemes, we derive our final algorithm, that we 
deem saddle-free Hessian-free optimization: 

y •«— ODE-solve (Equation Q, Au := H(0fc)^u, v = V/(0fc)) 

A6k ^ CG-Solve (H(6»fc)2, -ay) 

If I is the number of Runge Kutta steps we take to solve the ODE (O, and k is the number of CG 
iterations used to solve the linear systems, then the overall cost of the algorithm is 0{mlk). Since 
I is close to 20 in the successful experiments done by |[ll] on random matrices (independently of 
m), and k is no larger than 250 in typical Hessian-free implementations, this is substantially lower 
than the 0{m^) cost of saddle-free Newton. Eurthermore, one critical advantage is that the memory 
cost of the algorithm is 0{m), since at no moment it is required to store more than a small constant 
number of vectors of size m. 
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3 Conclusion and Future Work 


We presented a new algorithm called saddle-free Hessian-free optimization. This algorithm provides 
a first step towards merging the benefits of computationally efficient Hessian-free approaches and 
methods like saddle-free Newton, which are specially designed for nonconvex objectives. 

Further work will be focused on taking these ideas to real world applications, and adding more speed 
and stability improvements to the core algorithm, such as the preconditioners of 0 S] damping 
with Levenberg-Marquardt llH 1^ style heuristics. 
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